Skip to content

Debug about incremental indexing and also MCP#26

Open
Dennis-yxchen wants to merge 3 commits intomainfrom
debug
Open

Debug about incremental indexing and also MCP#26
Dennis-yxchen wants to merge 3 commits intomainfrom
debug

Conversation

@Dennis-yxchen
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens incremental indexing and MCP server behaviors to avoid mixed in-memory artifacts, improve repo naming stability for local paths, and eliminate stdout print() noise (important for MCP stdio).

Changes:

  • Add stable local-path repo naming (get_repo_name_from_path) with legacy index reuse and collision detection in MCP flows.
  • Harden indexing/caching flows: rebuild vector/graph state cleanly per repo, improve multi-repo cache loading correctness, and add incremental reindex safeguards (embedding dimension checks, effective repo_root, overview refresh, manifest persistence).
  • Add a comprehensive regression test suite covering incremental reindexing + MCP behaviors and verifying no print() calls in critical paths.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_incremental_indexing_regressions.py New regression tests for incremental reindexing, MCP fallback logic, repo naming stability, and stdout safety.
mcp_server.py Canonical repo naming for local paths, clean full-reindex helper, incremental fallback logic, stricter in-memory load validation, and enable incremental refresh for lookup tools.
fastcode/utils.py Introduces get_repo_name_from_path() for stable, disambiguated local repo identifiers.
fastcode/loader.py Uses get_repo_name_from_path() to name workspace copies consistently and support reuse of workspace copies.
fastcode/main.py Ensures clean per-repo in-memory indexing, fresh graph builder usage, safer cache-load rebuilds, multi-repo cache hardening, and more robust incremental reindexing.
fastcode/vector_store.py Replaces stdout print() with logger debug output in repo overview search.
fastcode/retriever.py Replaces stdout print() with structured logging during repo selection.
fastcode/query_processor.py Removes stdout print() of LLM response in favor of debug logging.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fastcode/main.py
Comment on lines +1219 to +1225
# failures cannot leave stale repos marked as loaded.
self.loaded_repositories = {}
self.repo_indexed = False
self.repo_loaded = False
self.multi_repo_mode = False
if hasattr(self.retriever, "current_loaded_repos"):
self.retriever.current_loaded_repos = None
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _load_multi_repo_cache, the in-memory merge reinitializes self.vector_store and resets self.retriever.current_loaded_repos, but it does not clear HybridRetriever's filtered caches (filtered_vector_store / filtered_bm25*). If a prior request loaded a filtered subset, subsequent queries without an explicit repo_filter can still use the stale filtered_vector_store (because _semantic_search prefers it when non-empty), returning incomplete results after a reload. Clear these filtered caches whenever rebuilding the merged in-memory index (either by calling invalidate_in_memory_state() here or explicitly resetting the filtered fields).

Suggested change
# failures cannot leave stale repos marked as loaded.
self.loaded_repositories = {}
self.repo_indexed = False
self.repo_loaded = False
self.multi_repo_mode = False
if hasattr(self.retriever, "current_loaded_repos"):
self.retriever.current_loaded_repos = None
# failures cannot leave stale repos marked as loaded or stale
# filtered caches on the retriever.
self.loaded_repositories = {}
self.repo_indexed = False
self.repo_loaded = False
self.multi_repo_mode = False
# Reset retriever state that depends on the in-memory index
if hasattr(self.retriever, "current_loaded_repos"):
self.retriever.current_loaded_repos = None
# Clear any filtered vector store used for subset queries
if hasattr(self.retriever, "filtered_vector_store"):
self.retriever.filtered_vector_store = None
# Clear all BM25-based filtered caches (e.g., filtered_bm25_corpus,
# filtered_bm25_index, etc.) so subsequent queries use the fresh index.
for attr_name in dir(self.retriever):
if attr_name.startswith("filtered_bm25"):
setattr(self.retriever, attr_name, None)

Copilot uses AI. Check for mistakes.
Comment thread mcp_server.py
Comment on lines +220 to +227
def strip_repo_hash_suffix(repo_name: str) -> Optional[str]:
base_name, sep, suffix = repo_name.rpartition("-")
if not sep or len(suffix) != 8:
return None
if not all(ch in "0123456789abcdef" for ch in suffix.lower()):
return None
return base_name or None

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip_repo_hash_suffix() is implemented twice in this file (once inside _repo_name_from_source and again inside _ensure_repos_ready). Duplicating the parsing rules increases the chance these helpers drift over time (e.g., suffix length/format changes) and creates subtle behavior mismatches. Consider extracting a single module-level helper and reusing it from both call sites.

Suggested change
def strip_repo_hash_suffix(repo_name: str) -> Optional[str]:
base_name, sep, suffix = repo_name.rpartition("-")
if not sep or len(suffix) != 8:
return None
if not all(ch in "0123456789abcdef" for ch in suffix.lower()):
return None
return base_name or None

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants